-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Major overhaul of mbstring (part 31) #10591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The new implementation is 2.5x-3x faster. If an invalid charset name was used, the old implementation would get 'stuck' trying to parse the charset name and would not interpret any other MIME encoded words up to the end of the input string. The new implementation fixes this bug. If an (invalid) encoded word ends abruptly and a new (valid) encoded word starts, the old implementation would not decode the valid encoded word. The new implementation also fixes this. Otherwise, the behavior of the new implementation has been designed to closely match that of the old implementation.
@@ -5641,3 +5613,232 @@ static void php_mb_gpc_set_input_encoding(const zend_encoding *encoding) /* {{{ | |||
MBSTRG(http_input_identify) = (const mbfl_encoding*)encoding; | |||
} | |||
/* }}} */ | |||
|
|||
static int8_t decode_base64(unsigned char c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can standard base64 decode function be used here? especially with #6361
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I'll have to check what its interface looks like and see if it fits what is needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the functions provided by base64.c
take a pointer/length pair (good) and return a zend_string
(not what I wanted).
But maybe I can adjust the code to use a zend_string
. Let me think about that a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps it's possible to make the other code internally work with a pointer/length pair. And let the existing callers of that code work with a wrapper for zend_string that extracts the pointer & length.
It would be good to know what this function does. 😄 The PHP manual isn't very verbose https://www.php.net/manual/en/function.mb-decode-mimeheader.php |
@kamil-tekiela Thanks for the comment. This might be helpful: https://en.wikipedia.org/wiki/MIME; scroll down to the subheading "Encoded-Word". The MIME standards (defined in various IETF RFCs) allow e-mail header values to include non-ASCII characters, by encoding them in a special format. The input string for I briefly mentioned the relevant standard in the code comments:
|
FYI, there's also |
Thanks for pointing that out! I really wonder why we have two different built-in functions in PHP which do the same thing. |
Yeah. I don't know. It might be that |
I found blog that about talk PHP 8.0 and PHP 8.1 or later of Imitate this behavior, I confirmed different to this PR and PHP 8.1 or later and PHP 8.0. Code is below.
This PR result is |
Result of |
@MaxKellermann There is also the following known issue with this function: https://bugs.php.net/bug.php?id=68821 This issue makes it behave differently to So in order to properly handle MIME mail headers one must use this:
|
Thanks for pointing that out. I reviewed the relevant RFC and you are 100% right; I have added a fix to this PR. |
@youkidearitai As usual, thanks for the great review! You have uncovered an interesting issue here. I haven't completed my investigation yet, but one first point: the test code which you kindly provided is buggy. The documentation for
In the code sample, Here is a sample where |
@youkidearitai OK, more findings... In PHP 8.1, I adjusted the mbstring conversion code for some legacy text encodings to flag errors rather than ignoring them. For example, in ISO-2022-JP, various encoding modes can be selected using escape sequences like
How should we treat this string? The useless When PHP 8.0 mbstring processes this string, it ignores the trailing ESC byte and treats the string as valid. ( This is the reason why PHP 8.1 Now, why do we have an invalid escape sequence in the string returned by In ISO-2022-JP, So basically:
|
Solution #1 mentioned by Satou-san in the above-referenced blog post is the correct one:
He says in the post that when he tried this, he found some strings which were corrupted when passed through |
There's actually another solution which it seems Satou-san didn't think about: provide a UTF-8 header, and let mb_internal_encoding('UTF-8');
$str = "This is a UTF-8 string. あいうえお。您好。";
mb_encode_mimeheader($str, 'ISO-2022-JP'); |
Indeed, Satou-san only presentation some one, we can't know what to do. We tried many workaround because multibyte mail is often mojibake. |
Sorry to hear that. I am ready to fix anything which is wrong, but first, we need test cases which demonstrate the problem. You have provided one very nice test case, which demonstrates very well what can happen when If you, or anyone else, can come forward with such a test case, I am ready to analyze the root cause. Understanding the root cause is very necessary before anything can be done about it. |
Hmm. I was hoping that maybe @youkidearitai might come up with some other test cases related to the reports of mojibake, but it seems that none are forthcoming now... I am inclined to merge this PR "as is" and leave the issue of using the SIMD-accelerate Base64 decoder as a possible future enhancement. If anyone strongly feels we need SIMD-accelerated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know about mime header encoding, but looks reasonable.
Landed... thanks all. |
I'm sorry for waiting me. But mail mojibake know-how is almost closed part of company. The problems reported in php-src only a fraction... |
One more step towards eliminating the legacy encoding conversion code.
mb_decode_mimeheader
is now around 3 times faster.FYA @cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai